In the last two lectures, we were analyzing health care data in the United States. We were exploring the following questions:
The data is from the Henry J Kaiser Family Foundation (KFF).
Let’s recall what we did before:
Load in the tidyverse of R package
Read in the datasets with the readr R package
coverage <- read_csv("../data/KFF/healthcare-coverage.csv",
skip = 2, col_names = TRUE)
coverage <- read_csv("../data/KFF/healthcare-coverage.csv",
skip = 2, col_names = TRUE,
n_max = which(coverage$Location == "Notes")-1)
spending <- read_csv("../data/KFF/healthcare-spending.csv",
skip = 2, col_names = TRUE)
spending <- read_csv("../data/KFF/healthcare-spending.csv",
skip = 2, col_names = TRUE,
n_max = which(spending$Location == "Notes")-1)
life <- read_csv("../data/KFF/life-expectancy.csv",
skip = 2, col_names = TRUE)
life <- read_csv("../data/KFF/life-expectancy.csv",
skip = 2, col_names = TRUE,
n_max = which(life$Location == "Sources")-1)Convert the coverage and spending datasets from wide to long formats using the tidyr R package
coverage <- gather(coverage, "year_type", "tot_coverage", -Location)
spending <- gather(spending, "year", "tot_spending", -Location)Wrangle the datasets using some dplyr action verbs
coverage <-
coverage %>%
separate(year_type, sep="__", into=c("year", "type"), convert = TRUE) %>%
mutate_at("tot_coverage", as.integer)
spending <-
spending %>%
separate(year, sep="__", into=c("year", "name"), convert = TRUE) %>%
select(-name)
life <-
life %>%
rename(life_exp_years = `Life Expectancy at Birth (years)`)Add abbreviation and region information to the coverage dataset
## [1] "Alabama" "Alaska" "Arizona" "Arkansas"
## [5] "California" "Colorado" "Connecticut" "Delaware"
## [9] "Florida" "Georgia" "Hawaii" "Idaho"
## [13] "Illinois" "Indiana" "Iowa" "Kansas"
## [17] "Kentucky" "Louisiana" "Maine" "Maryland"
## [21] "Massachusetts" "Michigan" "Minnesota" "Mississippi"
## [25] "Missouri" "Montana" "Nebraska" "Nevada"
## [29] "New Hampshire" "New Jersey" "New Mexico" "New York"
## [33] "North Carolina" "North Dakota" "Ohio" "Oklahoma"
## [37] "Oregon" "Pennsylvania" "Rhode Island" "South Carolina"
## [41] "South Dakota" "Tennessee" "Texas" "Utah"
## [45] "Vermont" "Virginia" "Washington" "West Virginia"
## [49] "Wisconsin" "Wyoming"
state.abb <- c(state.abb, "DC")
state.region <- as.factor(c(as.character(state.region), "South"))
state.name <- c(state.name, "District of Columbia")
coverage$abb <- state.abb[match(coverage$Location, state.name)]
coverage$region <- state.region[match(coverage$Location, state.name)]Join the datasets together
hc <- inner_join(coverage, spending, by = c("Location", "year"))
hc <- left_join(hc, life, by = c("Location"))
hc <- hc %>% filter(Location != "United States")
pop <- hc %>%
filter(type == "Total") %>% select(Location, year, tot_coverage)
hc <- hc %>%
filter(type != "Total") %>%
left_join(pop, by = c("Location", "year")) %>%
rename(tot_coverage = tot_coverage.x, tot_pop = tot_coverage.y)Add created two new columns prop_coverage and spending_capita
hc <- hc %>%
mutate(prop_coverage = tot_coverage/tot_pop,
spending_capita = (tot_spending*1e6) / tot_pop)
hc## # A tibble: 612 x 11
## Location year type tot_coverage abb region tot_spending
## <chr> <int> <chr> <int> <chr> <fct> <int>
## 1 Alabama 2013 Employ… 2126500 AL South 33788
## 2 Alaska 2013 Employ… 364900 AK West 7684
## 3 Arizona 2013 Employ… 2883800 AZ West 41481
## 4 Arkansas 2013 Employ… 1128800 AR South 20500
## 5 California 2013 Employ… 17747300 CA West 278168
## 6 Colorado 2013 Employ… 2852500 CO West 34090
## 7 Connecticut 2013 Employ… 2030500 CT Northea… 34223
## 8 Delaware 2013 Employ… 473700 DE South 9038
## 9 District of Col… 2013 Employ… 324300 DC South 7443
## 10 Florida 2013 Employ… 8023400 FL South 150547
## # ... with 602 more rows, and 4 more variables: life_exp_years <dbl>,
## # tot_pop <int>, prop_coverage <dbl>, spending_capita <dbl>
Let’s learn a little bit more about ggplot2 to answer those questions using the tools we have learned in the last two lectures
ggplot2 R packageIn the previous lecture, we learned about
ggplot() functionggplot() object)geom_point()We also created our first ggplot() object to explore the question
- Is there a relationship between healthcare coverage and healthcare spending in the United States?
hc %>%
filter(type == "Employer", year == "2013") %>%
ggplot(aes(x = spending_capita, y = prop_coverage)) +
geom_point() +
xlab("spending per capita") +
ylab("coverage proportion") +
geom_smooth(method = "lm", col = "red")It would be nice to know which state is represented by which state. For this, we will introduce another geom called geom_text().
geom_text()In our dataset, we have information about the abbreviation for each state. We could add the abbreviations for each state next to the point on the plot to assess which states have a higher or lower coverage for a given amount of money they spend per capita.
hc %>%
filter(type == "Employer", year == "2013") %>%
ggplot(aes(x = spending_capita, y = prop_coverage)) +
geom_point() +
xlab("spending per capita") +
ylab("coverage proportion") +
geom_smooth(method = "lm", col = "red") +
geom_text(aes(label=abb))That is cool, but it would be even better if we could nudge the text over a bit. Let’s look at the help file for geom_text():
We see there is an argument called nudge_x and nudge_y. We can use these to nudge the text over a bit so the text is not directly on top of the points.
hc %>%
filter(type == "Employer", year == "2013") %>%
ggplot(aes(x = spending_capita, y = prop_coverage)) +
geom_point() +
xlab("spending per capita") +
ylab("coverage proportion") +
geom_smooth(method = "lm", col = "red") +
geom_text(aes(label=abb), nudge_x = 150)Ok, getting back to our original question:
- Is there a relationship between healthcare coverage and healthcare spending in the United States?
We saw there was a positive relationship, but this was only for one type of healthcare coverage (Employer) and one year. What about the other types?
For this, we will introduce facets. The idea of faceting is to stratify the data by some variable and make the same plot for each strata.
For example, if we wanted to facet by the type variable, we will add a layer to our ggplot() object using the facet_grid() or facet_wrap() functions. The function expects the row and column variables to be separated by a ~.
hc %>%
filter(year == "2013") %>%
ggplot(aes(x = spending_capita, y = prop_coverage,
color = region)) +
geom_point() +
xlab("spending per capita") +
ylab("coverage proportion") +
geom_smooth(method = "lm", col = "red") +
geom_text(aes(label=abb), nudge_x = 150) +
facet_wrap(~type)## Warning: Removed 9 rows containing non-finite values (stat_smooth).
## Warning: Removed 9 rows containing missing values (geom_point).
## Warning: Removed 9 rows containing missing values (geom_text).
We see that the proportion of people covered have different scales in the y-axis. Let’s read the help file to see if there is some way to not restrict the y-axis to be the same.
Yes, we see there is an argument called scales that can be free_y, (free columns), free_x (free rows), and free (both). Let’s try free_y and look at a different year (year=="2014"):
hc %>%
filter(year == "2014") %>%
ggplot(aes(x = spending_capita, y = prop_coverage,
color = region)) +
geom_point() +
xlab("spending per capita") +
ylab("coverage proportion") +
geom_smooth(method = "lm", col = "red") +
geom_text(aes(label=abb), nudge_x = 150) +
facet_wrap(~type, scales="free_y")## Warning: Removed 10 rows containing non-finite values (stat_smooth).
## Warning: Removed 10 rows containing missing values (geom_point).
## Warning: Removed 10 rows containing missing values (geom_text).
Given we know Other Public refers to the military or Veterans Adminstration, we can see states like HI, VA, NV have a larger proportion of military or VA Other Public type coverage. While a state like AK has a similar proportion of Other Public coverage, it has a much larger spending per capita.
We also see a negative relationship with the Uninsured type. The more states spend, the less uninsured people in the state.
geom_boxplot()Next, let’s revisit the second question.
- Which US states spend the most and which spend the least on healthcare? How does the spending distribution change across geographic regions in the United States?
Let’s try making a boxplot with ggplot2. If you recall, the way to do this in base R was:
Now, we introduce the geom_boxplot() function. Note, we needed to tell ggplot2 what needs to be along the x and y axis in aes().
- How do healthcare coverage and spending relate to life expectancy?
In our last section, let’s first explore the relationship between spending and life expectancy using facets. We were faceting by healthcare type, so let’s stick with that, but change the y-axis to life_exp_years.
p <- hc %>%
filter(year == "2014") %>%
ggplot(aes(x = spending_capita, y = life_exp_years,
color = region)) +
geom_point() +
xlab("spending per capita") +
ylab("life expectactancy") +
geom_smooth(method = "lm", col = "red") +
geom_text(aes(label=abb))
p + facet_wrap(~type, scales="free")Hmm… So looks like there is a positive association for all the types (more you spend the longer the life expectacy). But there is something strange about the fact that the regions have very different life expectancies.
Let’s try faceting by both region and type. Note that we can facet by rows putting a column name before the ~ and facet by columns putting a column name after the ~. We are also using facet_grid() instead of facet_wrap().
What trends do you see?
Finally, let’s do the same for the prop_coverage column instead of spending_capita column to explore the relationship between life expectancy
p <- hc %>%
filter(year == "2014") %>%
ggplot(aes(x = prop_coverage, y = life_exp_years,
color = region)) +
geom_point() +
xlab("spending per capita") +
ylab("life expectactancy") +
geom_smooth(method = "lm", col = "red") +
geom_text(aes(label=abb))
p + facet_grid(region~type, scales="free")## Warning: Removed 10 rows containing non-finite values (stat_smooth).
## Warning: Removed 10 rows containing missing values (geom_point).
## Warning: Removed 10 rows containing missing values (geom_text).
What trends do you see? How would you summarize these results? What’s most interesting?
ggplot2 cheatsheetFor the last 20 mins of class, we will break into groups and work on a mini-case study. You can find the case study on the course website.